Skip to content

Conversation

@armenzg
Copy link
Member

@armenzg armenzg commented Oct 30, 2025

Snuba can sometimes raise errors when we're trying to remove from the Nodestore.

The changes here will retry certain of those errors.

@armenzg armenzg self-assigned this Oct 30, 2025
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 30, 2025
"deletions.nodestore.retry",
tags={"type": f"snuba-{type(snuba_error).__name__}"},
sample_rate=1,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test Fails Due to Unexpected Error Handling

The test_snuba_errors_retry test expects UnqualifiedQueryError("All project_ids from the filter no longer exist") to trigger a retry. However, the code specifically handles this error by logging an info metric and completing normally, which causes the test to fail and contradicts the behavior in test_deletion_with_all_projects_deleted.

Fix in Cursor Fix in Web

@codecov
Copy link

codecov bot commented Oct 30, 2025

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
41439 2 41437 254
View the top 2 failed test(s) by shortest run time
tests.sentry.deletions.tasks.test_nodestore.NodestoreDeletionTaskTest::test_unqualified_query_error
Stack Traces | 3.8s run time
#x1B[1m#x1B[.../deletions/tasks/test_nodestore.py#x1B[0m:123: in test_unqualified_query_error
    with pytest.raises(DeleteAborted):
#x1B[1m#x1B[31mE   Failed: DID NOT RAISE <class 'sentry.exceptions.DeleteAborted'>#x1B[0m
tests.sentry.deletions.tasks.test_nodestore.NodestoreDeletionTaskTest::test_snuba_errors_retry
Stack Traces | 3.88s run time
#x1B[1m#x1B[.../deletions/tasks/test_nodestore.py#x1B[0m:159: in test_snuba_errors_retry
    with pytest.raises(RetryError):
#x1B[1m#x1B[31mE   Failed: DID NOT RAISE <class 'sentry.taskworker.retry.RetryError'>#x1B[0m

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.


# TODO: Add specific error handling for retryable errors and raise RetryTask when appropriate
except Exception:
metrics.incr(f"{prefix}.error", tags={"type": "unhandled-exception"}, sample_rate=1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric let me see that something changed in the last couple of days and to this Sentry issue.

Image

# This is not a transient error - retrying won't help since the project is permanently gone.
logger.info("All project_ids from the filter no longer exist")
# There may be no value to track this metric, but it's better to be safe than sorry.
metrics.incr(f"{prefix}.info", tags={"type": "all-projects-deleted"}, sample_rate=1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the metric from warning to info.

metrics.incr(f"{prefix}.warning", tags={"type": "all-projects-deleted"}, sample_rate=1)
# When deleting groups, if the project gets deleted concurrently (e.g., by another deletion task),
# Snuba raises UnqualifiedQueryError with the message "All project_ids from the filter no longer exist".
# This happens because the task tries to fetch event IDs from Snuba for a project that no longer exists.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to remember that we delete from the Nodestore as a best effort since eventually the events will ttl.

Alternatively, we could make the nodestore tasks delete the project rather than in the spawning task.

metrics.incr(f"{prefix}.error", tags={"type": "unqualified-query-error"}, sample_rate=1)
# Report to Sentry to investigate
raise DeleteAborted(f"{error.args[0]}. We won't retry this task.") from error
metrics.incr(f"{prefix}.error", tags={"type": type(error).__name__}, sample_rate=1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using type(error).__name__ instead of unqualified-query-error to distinguish the errors in DD.

logger.warning(
f"{prefix}.retry", extra={**extra, "error_type": error_type, "error": str(error)}
)
raise RetryTask(f"Snuba error: {error_type}. We will retry this task.") from error
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new feature in this PR. I believe the errors above can be retried (we have an upper bound of how many times we would try).

) as error:
error_type = type(error).__name__
metrics.incr(f"{prefix}.retry", tags={"type": f"snuba-{error_type}"}, sample_rate=1)
logger.warning(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't trigger a Sentry event every time this happens since the retried task may succeed.

raise RetryTask(f"Snuba error: {error_type}. We will retry this task.") from error

except Exception as error:
metrics.incr(f"{prefix}.error", tags={"type": type(error).__name__}, sample_rate=1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using type(error).__name__ instead of unhandled-exception to distinguish in Datadog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants